\documentclass{sig-alternate}
%\documentclass{sig-alternate-nocopyright}

%-- Begin patch area for accents in 'Author Block' area -
%   may be needed by some authors / but not all

% Needed for "Author Block" accents - Patch by Gerry 3/21/07
\DeclareFixedFont{\auacc}{OT1}{phv}{m}{n}{12}   

% Needed for "Author Block" accents in the affiliation/address line -
% Patch by Gerry 3/21/07
\DeclareFixedFont{\afacc}{OT1}{phv}{m}{n}{10}   
%--

\begin{document}
%
% --- Author Metadata here ---
%\conferenceinfo{WOODSTOCK}{'97 El Paso, Texas USA}
\CopyrightYear{2007}
% Allows default copyright data (0-89791-88-6/97/05) to be over-ridden -
% IF NEED BE.
%\crdata{0-12345-67-8/90/01}  
% --- End of Author Metadata ---

\title{An Algorithm For Discovering Similar Blogs
\titlenote{Copyright is held by the authors}}
%\subtitle{Foo Foo
%\titlenote{A full version of this paper is available as
%\textit{Author's Guide to Preparing ACM SIG Proceedings Using
%\LaTeX$2_\epsilon$\ and BibTeX} at
%\texttt{www.acm.org/eaddress.htm}}}
%

\numberofauthors{3}
\author{
% 1st. author
\alignauthor
Eran Chinthaka \\
       \affaddr{Indiana University Computer Science Department}\\
%       \affaddr{150 S Woodlawn Avenue}\\
%       \affaddr{Bloomington, IN 47405}\\
       \email{echintha@cs.indiana.edu}
% 2nd. author
\alignauthor
Michael D. Conover \\
       \affaddr{Indiana University School of Informatics}\\
%       \affaddr{901 E. 10th St.}\\
%       \affaddr{Bloomington, IN 47408}\\
       \email{midconov@indiana.edu}
% 3rd. author
\alignauthor Michel A. Salim \\
       \affaddr{Indiana University Computer Science Department}\\
%       \affaddr{150 S Woodlawn Avenue}\\
%       \affaddr{Bloomington, IN 47405}\\
       \email{msalim@cs.indiana.edu}
}

\date{26 April 2007}

\maketitle
\begin{abstract}

%We hypothesize that blogs form a web of communities; i.e. that the
%link structure exposes semantic similarity. We have developed a
%framework that rank blogs based on their similarity to a query.
%
% From Mike
Every day 175,000 new blogs are created, added at a rate of two per
second to the sixty-million-plus online journals already chronicling
everything from politics, knitting, and finance to the latest advances
in the sciences and technology~\cite{Disc07}. With this explosive
growth come many unique challenges, chief among them is the
identification of information most relevant to each unique information
consumer.  To address this problem we have created a system which
harnesses the social network-like topology~\cite{herring2005cba} of
the blogosphere to recommend relevant blogs to users based on sites
they already know and enjoy.

Our findings indicate that the algorithm we have proposed outperforms
Kleinberg's HITS algorithm~\cite{kleinberg1999ash} in identifying and
ranking relevant blogs.
%%%

\end{abstract}

% A category with the (minimum) three required fields
%\category{H.4}{Information Systems Applications}{Miscellaneous}
%A category including the fourth, optional field follows...
%\category{D.2.8}{Software Engineering}{Metrics}[complexity measures, performance measures]

%\terms{blogosphere, blogroll, FOAF}

\keywords{blog, blogosphere, blogroll, FOAF}

\section{Introduction}

%Past studies have shown that the World Wide Web is organized into web
%communities, and that semantically similar sites can be identified
%based on their link structures.

%\textbf{[TODO: Mike, have a look at this]} -- We hypothesize that a
%similar organizational structure is present in the blogosphere: that
%blogs are more likely to link to blogs of a similar topic. As such,
%algorithms that have been shown to work on the web in general should
%also work, but we also develop a novel algorithm that is more suited
%to the link patterns observed in blogs.
%
%From Mike
The blogosphere is a highly social place; analysis of its hyperlink
structure demonstrates the emergence of tightly-clustered topical
communities of bloggers with many participants engaged in interactive
discussions of current events~\cite{adamic2005pba,herring2005cba,lin3dbc}. It is
this social property of the blogosphere that motivates the
``friend-of-a-friend''-style recommendation system we have
proposed. Intuitively, the more friends two strangers have in common,
the more likely it is that they share common tastes and interests.
Likewise, because the blogosphere exhibits many properties of a social
network, it is reasonable to posit that the more neighbors two blogs
have in common, the more they share some common appeal.
%%%

%\section{Related Work}
%
%Gibson, Kleinberg and Raghavan~\cite{gibson1998iwc} studied the link
%structure of Internet sites and how they can be used to infer the
%existence of communities of related sites.
%
%Dean and Henzinger~\cite{dean1999frp} described two algorithms (one
%derived from HITS) that use link information to identify related web
%pages.
%
\section{Graph of the Blogosphere}
In addition to the regular hyperlinks present in posts, a common
feature of many blogs is the ``blogroll'', a set of
links to other sites that the author has identified as interesting and
enjoyable.  It is this characteristic of blogroll links
that led us to choose them as the primary source for network data in
the development of our recommendation system.

To populate our dataset we utilized information provided by the site
\texttt{blogrolling.com}, which monitors the inbound blogroll
connections to thousands of sites.  We seeded our data collection
efforts with the URLs of the 500 most popular blogs, as identified by
the blog-monitoring site  \\ \texttt{technorati.com}.  The inbound links to each of
these sites according to \texttt{blogrolling.com} were recorded, and the source
URLs were added to the collection queue.  This process was allowed to
continue until we had identified 426,093 links between 52,056 blogs.

\section{Recommendation}
Our FOAF technique allows a user to find blogs that are similar to one
with which he is already familiar.  To accomplish this the algorithm
must first identify a set of of candidate blogs and then rank the
members of that set according to a scoring measure.

For any given blog it is a trivial task for a user to identify the sites
listed in the blogroll as candidates of potential interest.  Ranking
these sites in terms of relevance may be more difficult due to the
sheer size of many blogrolls, however this task is still tractable if
the user visits each site and makes a decision about relevance in that
manner.  To extend this process further, and manually repeat this process for 
each of the blogs in the initial site's blogroll would be prohibitively tedious.  We seek to approximate this process
algorithmically instead.

\subsection{Identifying Candidates}
Our system mimics the user's decision making behavior in the following
way.  For the original blog of interest, $B$, the FOAF algorithm
populates a network with directed edges from a node representing $B$ to
nodes representing all of the blogs listed in $B$'s blogroll.  This set
of nodes is called $D_1$.  This process is then repeated for each of the
blog-nodes in the $D_1$ set, with the resulting nodes becoming members
of another set, $D_2$.  It should be noted that some of the nodes in $D_1$
may be linked from other nodes in $D_1$, and thus have membership in
both $D_1$ and $D_2$.  Through this process the algorithm identifies all of
the ``friends-of-a-friend'' for the original blog, $B$.

\subsection{Ranking Candidates}
We experimented with two measures used for scoring blogs, both based
on the number of mutual neighbors, or friends, that a specific member
of $D_2$, called $D'$, shares with the original blog $B$.  Additionally we 
used Kleinberg's HITS algorithm as a baseline for comparison.

\subsubsection{Raw Score}
The first  measure with which we experimented  simply scores according
to the  number of mutual  neighbors between $D'$  and $B$.  This  score is
normalized between zero and one by dividing by the total number
of possible neighbors.

\begin{equation}
RS(D') = \frac{\lvert Neighbors \rvert}{\lvert Nodes in D_1 \rvert}
\end{equation}

One problem with this approach is that extremely popular blogs with
very high in-degree will naturally tend to have more neighbors in
common with any given node simply due to their high levels of
connectivity.  As a result, this measure leads to an
over-representation of highly popular sites in the result set.

\subsubsection{Community-Weighted Measure}
To address the problem of over-representation of globally 
nodes we decided to try an approach that weighted the original raw
score for a given node by the degree to which it was a member of the
community surrounding the original node B.  For each node this is
approximated by dividing the number of in-links from members $D_1$ and $D_2$
by the total in-degree of that node.

\begin{equation}
S2(D') = RS(D') \times \frac{[Inlinks from D_1 + D_2]}{[ Global In-Degree]}
\end{equation}

\subsubsection{HITS}
To provide a baseline for comparison, we run the Kleinberg HITS
algorithm on the subgraph whose nodes consist of blogs in $D_1$ and
$D_2$.
%\section{Algorithms}
%\subsection{Secret Sauce}
%This algorithm is based on Friend-of-a-Friend (FOAF). The rationale is
%that a blog is likely to cite other blogs (that we call the \emph{D_1}
%set, for distance=1) that share at least a single topic of common
%interest. The blogs that these blogs cite (\emph{D_2}) should thus also
%include those on that shared topic, and the more paths there are to a
%\emph{D_2} blog (through any blog in \emph{D_1}), the more likely this
%blog is to be relevant, since more of the source blog's friend agrees
%on it.

%\subsection{HITS}

%\subsection{HITS with Co-citation and Co-reference}
%This uses the standard HITS implementation, but the initial subgraph
%is enriched with sites that co-refer the same targets and sites that
%are co-cited (a parent site links to both the source site and this
%site)

%\section{Implementation}
%\subsection{Graph Framework}
%We build our graph framework on top of the open-source Java Unified
%Network/Graph Framework, adding

%\subsection{Web Application}
%We have written a web interface 

\section{Results}
The raw scoring method outperforms both hits and the community-weighted 
measure in our informal user study on approximately a dozen URLs.  A representative 
selection is listed below.
\subsection{Precision}
To evaluate precision we selected four well-connected blogs from
varying topic areas. Precision is expressed in terms of percentage of
relevant hits in the top 20 results, as decided by the authors.
\newline

%\begin{table}
\begin{tabular}{|c|c|c|c|}
\hline
Blog URL & Raw \% & Comm \% & HITS  \%\\
\hline
\texttt{sandysknitting.com} & 65 & 55 & 25 \\
\texttt{busymom.net} & 80 & 55 & 60 \\
\texttt{talkleft.com} & 90 & 65 & 80 \\
\texttt{photojunkie.ca} & 75 & 40 & 50 \\
\hline
\end{tabular}
%\caption{Accuracy results}
%\end{table}

\section{Discussion}
As the results above indicate, the raw score calculation provides the
most accurate identification of relevant blogs.  Disappointingly,
however, the community-weighted measure performed much worse than
either HITS or the original raw scoring measure.  While it did
effectively eliminate the over-representation of globally popular
nodes from the result set, it also penalized blogs that were
legitimately related to the query but had a broad cross-topical
appeal.  For example, a photography blog that also covered
technology-related issues would appeal to members of both photography
and technology communities, and thus the relative in-degree from
members of either community exclusively is diminished.  HITS
frequently identified several of the same top-ranking sites as the raw
scoring measure, but often diluted the result set with highly popular
but less directly relevant results.

\subsection{Shortcomings}
In addition to the over-representation of blogs with high global
in-degree, one important shortcoming of this mechanism is that it
relies upon outbound links from the query blog to identify the
candidate set of relevant URLs.  Unfortunately, most of the hugely
popular blogs on the internet exhibit a high level of in-degree and
very low levels of out-degree linking~\cite{herring2005cba}. Thus our
algorithm has difficulty generating relevant results for these blogs.

\section{Conclusions}

%We have developed an algorithm for ranking blogs in the link
%neighborhood of a query blog such that the resulting ordering
%approximates their semantic similarity to the query.

The FOAF algorithm performs most effectively on blogs that focus on a
specific topic within a tight community, and performs worst on blogs
that are exceptionally globally popular.  An effective
means to eliminate the over-representation of popular blogs from the
result set would be a desirable direction for future work, however we
did not find it difficult to identify meaningfully-related
blogs in spite of the inclusion of these results. Overall we find the
FOAF technique to be very effective at finding new and interesting
blogs, and will continue to use it to guide our media location efforts
into the future.

%\section{Future Work}
%We plan to do a double-blind user study, aimed at answering the
%following questions:
%\begin{enumerate}
%\item How well does link structure predict semantic structure,
%  regardless of the specific algorithm
%\item How well does our algorithm fare against similar algorithms,
%  e.g. those derived from HITS

%\end{enumerate}

%ACKNOWLEDGMENTS are optional
%\section{Acknowledgments}

%
% The following two commands are all you need in the
% initial runs of your .tex file to
% produce the bibliography for the citations in your paper.
\bibliographystyle{abbrv}
\bibliography{web-mining}
% You must have a proper ".bib" file
%  and remember to run:
% latex bibtex latex latex
% to resolve all references
%
% ACM needs 'a single self-contained file'!
%

%\balancecolumns % GM MARCH 2007
% That's all folks!
\end{document}
